84

7

The Transmission of Information

a sort of Boltzmann distribution.upper CC is a constant fixed by the condition thatsigma summation p Subscript i Baseline equals 1Σ pi = 1,

and upper DD is an as yet undetermined constant.

Suppose that the words are made up of individual letters (symbols) and demarcated

by a special word demarcation symbol (the space in many languages). Cost, length,

and number of letters are all proportional to each other. If the letters can be chosen in

any way from an alphabet ofupper AA different ones, by the multiplication rule (Sect. 8.2.1)

there are upper A Superscript nAn different nn-letter words. Let these words now be ranked in order of

increasing cost and call this rank rr. Since the cost increases linearly with nn, it only

increases logarithmically with rank, 9 that is,

c Subscript r Baseline equals log Subscript upper A Baseline r periodcr = logA r .

(7.5)

Substituting Eq. (7.5) into (7.4), one obtains a power law relation

p Subscript r Baseline equals upper C r Superscript negative upper B Baseline commapr = CrB ,

(7.6)

known as Zipf’s law when upper B equals 1B = 1. Mandelbrot has shown that, more precisely, Eq.

(7.6) is

p Subscript r Baseline equals upper C left parenthesis r plus rho right parenthesis Superscript negative upper Bpr = C(r + ρ)B

(7.7)

and that the constantupper BB (subsumingupper DD in Eq. 7.4), the reciprocal of the informational

temperaturethetaθ of the distribution (by analogy with the thermodynamic case), can take

values other than 1. Forupper B greater than 1B > 1 (i.e.,theta less than 1θ < 1), the language is called open (because the

value ofupper CC does not greatly depend on the total number of words), whereas forupper B less than 1B < 1

it does, and the corresponding language is called closed. The constantrhoρ is connected

with the freedom of choosing words (cf. Sect. ??), but a deep interpretation of its

significance in messages has not yet been given. Equation (7.7) fits the distribution

of written texts remarkably well, and most languages such as English, German,

and so forth are open, whereas highly stylized languages (e.g., modern Hebrew and

the English of the Pennsylvania Dutch) are closed. thetaθ is a measure of the agility of

exploiting vocabulary; low values are characteristic of children learning a language or

schizophrenic adults; the richest and most imaginative use of vocabulary corresponds

to theta equals 1θ = 1.

There are many heuristic methods for compression. Dictionaries (i.e., lists of

frequent words) are often used for word texts. In rastered images, successive lines

typically show small changes; large blocks are uniformly black, grey or white, and so

on. A useful way of compressing long sequences of symbols is to search for segments

that are duplicated. The duplicates can then be encoded by the distance of the match

from the original sequence and the length of the matching sequence (number of

symbols). Zipping software typically works on this principle; 10 the compression is

9 The words are listed in order of increasing cost; rank 1 has the lowest cost and so on.

10 For example, Ziv and Lempel (1977).